Prerequisites

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com or the YouTube tutorial (in french).

A cheat-sheet can be found here

Preparation of the work space

R & RStudio install

R & the IDE RStudio can be install from this web page : https://posit.co/download/rstudio-desktop/

  1. Download then Install R
  2. Download then Install RStudio

Cheat-sheet of RStudio can be found here.

Install package:

Packages can be install with the instruction install.packages() (Don’t forget the" "): Example for the ggplot2 package:

# Install from CRAN
install.packages("ggplot2")

To see all the packages which are installed use the code below:

# Installed packages
installed.packages()

Use a package

After the installation of the desired package, it is neceesary to upload it in R thanks to library command:

library("ggplot2")
library("tibble")

Remark: Some library needs to compilation, for this reason each library should be called one by one.

An other way consist to put all the library desired in a variable thank to a vector (see below), then upload it with the function lapply as it is showed below:

x <- c("readr","tibble","tidyr","ggplot2","dplyr", "factoextra",
       "FactoMineR", "rgl", "colorspace","ggthemes","ggpubr", "readxl")
lapply(x,require,character.only = T)

Work in a Working Directory

Two ways:

  • setwd("url_of_the_directory")
  • In the “File” space, More option -> select: “Set As Working Directory”

To check if we are in the good directory:

  • getwd() => returns the absolute filepath representing the current directory

To know the list of file and directory in the current directory:

  • list.files()

The basic

Good resources: https://bookdown.org/ael/rexplor/

Be careful ==> R is case sensitive so A and a are different

Assignation of a variable:

In R, the variable can be assigned with two symbol <- OR =. But the first one is the most used.
To compile a “list” of number, it is needed to create a vector thanks to c() with , for the separation.
Then thanks to the function class you can see the class of the variable.

x <- 20
l <- c(10, 44, 89)
s <- "I am a character"
bo <- TRUE # a booleen

paste("Variable x:",x)
## [1] "Variable x: 20"
class(x)
## [1] "numeric"
paste("Variable l:",l)
## [1] "Variable l: 10" "Variable l: 44" "Variable l: 89"
class(l)
## [1] "numeric"
paste("Variable s:", s)
## [1] "Variable s: I am a character"
class(s)
## [1] "character"
paste("Variable s:", bo)
## [1] "Variable s: TRUE"
class(bo)
## [1] "logical"

Be careful
Some variable names are already used by the software (examples: q, T, F, D,I, var, mean…) ==> Do not use them !

The command paste() is used to concatenate string and variable. A space is automatically include as separator. To change it, add the argument , sep="" with no space or other symbol. It is possible to used also paste0.

For any help, use the command help()or the symbol ?. Example:

help(class)
?ggplot2

To remove an object: rm(). At the begining of a session/project, it can be usefull to clean all the data which was kept by RSudio in its memory:

rm(list = ls())

Use Vectors

The utilization of the vector c() do not create a real list but a vector.
For a list, it is needed to call the function list:

li <- list(10, 44, 89)
li
## [[1]]
## [1] 10
## 
## [[2]]
## [1] 44
## 
## [[3]]
## [1] 89
class(li)
## [1] "list"

The vector can not create missed value because all that is useless is not created. Instead of, the list permit the creation with null value:

vector <- c(1, 5, NULL)
liste <- list(1, 5, "u", NULL)
vector
## [1] 1 5
liste
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] "u"
## 
## [[4]]
## NULL

Also the vector can combine different types of objects, whereas the vector can not combine numeric and character ==> All the element in the vector become a string:

vector <- c(1, 5, "u")
liste <- list(1, 5, "u")
vector
## [1] "1" "5" "u"
liste
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] "u"

To generate a vector of the number between 2 limits (limits are include): - vector1 <- c(50:100) - vector1 <- seq(from=0, to=5, by=0.5) increment by 0.5

The vectors can be combine:

v1 <- rep(x = c(1,5), each=3) # repeat each time the value in the vector
v2 <- rep(x = c(1,5), time=3) # repeat each value 3 times
v3 <- c(v1, v2)
v1
## [1] 1 1 1 5 5 5
v2
## [1] 1 5 1 5 1 5
v3
##  [1] 1 1 1 5 5 5 1 5 1 5 1 5

Use the function sample() to generate random values

lances <- sample(x=c("pile", "false"), size=10, replace=TRUE, prob=c(0.4,0.6))
lances
##  [1] "false" "pile"  "pile"  "false" "false" "pile"  "pile"  "false" "false"
## [10] "false"
  • size give the number of trial
  • replace=TRUE allows sampling with replacement (by default is a float number)
  • prob allows to define the probabilities of appearance of each value`

Extracted vectors

In a vector, each element is located thanks to:

  • its index ==> the rank of the element: In R the rank stat by 1
  • its label ==> Name of the element by using the function names(vector): The label must be define after the creation of the vector
notes <- c(12,15,8,9,11)
names(notes) <- c("Villon", "Polin", "Exfi","Rotaf","Zerif")

Data extraction is done with the [] symbol:

notes[1]
## Villon 
##     12
notes["Villon"]
## Villon 
##     12

Manipulation on the number & character

  • round(x, digits = y) ==> round the number x with y digits, if you define the round function in the variable, the variable keep the round number in memory

  • strsplit(x="pocket.365_rt", split="_") ==> Split the string x in function of the split argument. !! DO NOT USE THE DOT “.” to split, it’s a special character which delete every thing in the string

  • nchar ==> number of character

  • toupper, tolower ==> Upper and Lower the string

  • grep(x) ==> Give all the file with the x inside the name

Dataset Manipulation

Matrix and dataframe

2 dimensions set of elements divide in row and columns.
- Matrix ==> accept an unique type for all columns - Data frame ==> accept different types of columns

L3 <- LETTERS[1:3]
fac <- sample(x=L3, size=10, replace=TRUE)
df1 <- data.frame(x = rep(x = 1, time=length(fac)), y=1:10, fac_letter=fac)
df1

The labeling of the rows and the columns can be done after using colnames and rownames.

mat1 <- matrix(c (rexp(n=10, rate=2),
               rexp(n=10, rate=2.5),
               rexp(n=10, rate=1.8)),
               nrow=5, ncol=3)
## Warning in matrix(c(rexp(n = 10, rate = 2), rexp(n = 10, rate = 2.5), rexp(n =
## 10, : data length differs from size of matrix: [30 != 5 x 3]
colnames(mat1) <- paste("cond", 1:ncol(mat1), sep="_")
rownames(mat1) <- paste("id", 1:nrow(mat1), sep="_")
mat1
##         cond_1    cond_2     cond_3
## id_1 0.2493434 0.1687926 0.97938653
## id_2 0.2172397 2.2540553 0.49918140
## id_3 0.7634651 0.5221910 0.69246513
## id_4 0.3398403 0.1969599 0.48204954
## id_5 0.1129144 0.2931617 0.07123217

Importation of csv file

To import Excel File, the library readxl is needed to be install.

csv files are directly import as a data.frame. A data set can be displayed thanks to the command View().

smp <- read.csv2("~/code/R/FUN/fichier 1/smp1.csv")
View(smp) # With upper V 

Some basic manipulations

To obtain a particular value inside the data.frame, use bracket where columns are separated by a comma: df[index_ligne, index_column]

df1[4,2]
## [1] 4
df1[,"fac_letter"]
##  [1] "B" "A" "A" "B" "B" "B" "B" "A" "C" "A"
df1$fac_letter # the same goal that the previous one but in the rare occasion can cause some issues
##  [1] "B" "A" "A" "B" "B" "B" "B" "A" "C" "A"
df1[4,]
df1[c(2,5:9), 2]
## [1] 2 5 6 7 8 9

To see the 6 first observations for the all variables, use the command head().

# Utilization of the Iris data set, directly included in R
class(iris)
## [1] "data.frame"
head(iris)

The command whichgive the position of observation who respect the condition:

  • == for an equivalence
  • !> for a difference
  • <= / >= less/greater than or equal to
  • </ > strictly less/greater than
  • & / || AND / OR ==> for the multiple conditions
  • isTRUE(x) test if X is TRUE
  • na.rm=TRUE to remove the empty values
which(iris$Species != "setosa")
##   [1]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [19]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
##  [37]  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104
##  [55] 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
##  [73] 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
##  [91] 141 142 143 144 145 146 147 148 149 150

To know the exact value:

head(iris[which(iris$Sepal.Length>5),])

To see the unique value of a column, use the function unique():

unique(iris$Species)
## [1] setosa     versicolor virginica 
## Levels: setosa versicolor virginica
length(unique(iris$Species)) # Give the number of unique values
## [1] 3

The command table group and count each observation for a given variable.
The command subset(data.frame, condition, c(vector of variable)), take the observation who respect the condition and keep only the variables asked.

table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
table(iris$Species != "setosa")
## 
## FALSE  TRUE 
##    50   100
table(subset(iris, Species == "setosa")$Petal.Length > 1.4)
## 
## FALSE  TRUE 
##    24    26
head(subset(iris, Species == "setosa", c(Sepal.Length, Petal.Length)))

Make a tibble instead of data.frame

Upload the library tibble. This one allowed the creation of a tibble (table) that it has the same purpose than a data.frame but with more restriction. The tibble must to have the same size for each column.

library(tibble)
# creation of the dataset
# m1 a list with Null values
m1<-list(1.311,1.287,1.293,1.308,1.291,1.300,1.274,1.287)
m1 <- append(m1, vector("list",5)) # append 5 element NULL contain in a list
m2<-c(1.298,1.309,1.293,1.251,1.338,1.302,1.270,1.339,1.346,1.292,1.291,1.321,1.285)
Si<-tibble("method1"=m1,"method2"=m2)

Use the functional programming

The functional programming is a paradigm of building computer program with a declarative type. In this case, the programs are constructed by applying and composing functions. In R the tibble library (and other tixxx) allow this kind programming with the following construction variable %>% function(). The functional is a better way to write a comprehensive workflow:

# recover a list containing the values of method 1 padded with NULL values then converted to a vector, without NULL values
m1<-Si$method1 %>% unlist()
m1
## [1] 1.311 1.287 1.293 1.308 1.291 1.300 1.274 1.287
class(m1)
## [1] "numeric"

Manipulation

The libraries tidyr & dplyr

These libraries are useful to manipulate a data set.

For example, in the first table, it is supposed that each samples have been measured with the both two methods:

  • sample 1 : value m1 / value m2
  • sample 2 : value m1 / value m2

In this case, measurements have been performed independently of the samples. So the first table does not work. Here it will be better that each value receive tag corresponding to its origin (method1 or method2). The code below propose this kind of table:

library(dplyr, tidyr)
## 
## Attachement du package : 'dplyr'
## Les objets suivants sont masqués depuis 'package:stats':
## 
##     filter, lag
## Les objets suivants sont masqués depuis 'package:base':
## 
##     intersect, setdiff, setequal, union
Si2<-tibble(method=c(rep("method1",8),rep("method2",13)),values=c(m1,m2))

# Let's simulate a randomized acquisition of the dataset
Si2<-Si2[sample(nrow(Si2)),] # Indexes are generated by random pick, the tibble is shuffled
head(Si2)

Others manipulation trick

# Transposition without some columns
Tab.transpo <- data.frame(t(Tab.Atranspo[,-c(1:3)]))

# Put the name of the line into the 1st column (String)
Tab.transpo <- rownames_to_column(Tab.transpo, "NameColumn") 

# reformat as numeric
Tab.aDeriver$variable_for_derivation <- as.numeric(Tab.aDeriver$variable_for_derivation)

## Function of the first derivative

FD2 <- data.frame(apply(Tab.aDeriver[,-1],2,function(x){ #apply fwork as a loop, apply function for each column (1 = row)(= sample)
            fd <- predict(sm.spline(Tab.aDeriver$variable_for_derivation,
                                    x),Tab.aDeriver$variable_for_derivation,2) #do the first derivative for all values
            return(fd) #returns a dataframe with values at each points
        }))

Play with tidyr & dplyr libraries and functional programming

This libraries allow to make a code more “readable” thanks to %>% symbols.

m1b<-Si2 %>% filter(method == "method1") %>%
  select(values) %>%
  unlist()
m1
## [1] 1.311 1.287 1.293 1.308 1.291 1.300 1.274 1.287
m1b
## values1 values2 values3 values4 values5 values6 values7 values8 
##   1.300   1.311   1.287   1.287   1.291   1.308   1.293   1.274

Cheat-sheet for more information on those libraries:

https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Statistics operations

A rapid view of the global statistic descriptor with the function summary():

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Note: iris is a dataset which is already include in R

Basic operations:

sum(iris$Sepal.Length) # Addition of all the observation in the variable
## [1] 876.5
mean(iris$Sepal.Length) # Average
## [1] 5.843333
sd(iris$Sepal.Length) # Standard deviation
## [1] 0.8280661
q <- quantile(iris$Sepal.Length) # Catch all the quantiles 
q
##   0%  25%  50%  75% 100% 
##  4.3  5.1  5.8  6.4  7.9
q[1]
##  0% 
## 4.3
length(iris$Sepal.Length) # Number of observation in the variable
## [1] 150

Confidence interval

The confidence interval at 95% : \(IC=\bar{x}\pm1,96*se\) with:

  • \(se\) the standard error ==> the standard deviation \(\sigma\) of the mean
  • \(se=\frac{\sigma}{\sqrt{n-1}}\) with \(n\) the number of observations
  • The \(1.96\) correspond to a quantile \(\gamma\) of a normal student law for a confidence \(\alpha\) of 95% given by \(\gamma=\frac{1-\alpha}{2} = \frac{1-0.95}{2}=0.025\)
  • R can calculate \(\gamma\) with: qnorm(y) & qnorm(1-y)
y <- (1-0.95)/2
y
## [1] 0.025
signif(qnorm(y),3)
## [1] -1.96
signif(qnorm(1-y),3)
## [1] 1.96

The standard error is the standard deviation of the mean of distributions and the standard deviation is for the individual distribution of a normal law.

x <- iris$Sepal.Length
x_N<-length(x)
x_sd<-sd(x)
x_mu<-mean(x)
x_se<-x_sd/sqrt(x_N)
x_ICmin<-x_mu+x_se*qnorm(0.025)
x_ICmax<-x_mu+x_se*qnorm(0.975)
paste(signif(x_ICmin,3),"<",signif(x_mu,3),"<",signif(x_ICmax,3))
## [1] "5.71 < 5.84 < 5.98"

The command signif(variable, number) print the variable with the number of significant value asked.


Evaluation of probability law

With R, it is easy to evaluate a probability law:

With x & q vector of values to be evaluate by mylaw. Can be just one value for one evaluation, several values or an interval to draw a graph for exemple. p is the probabilitie of mylaw. And n the number of observations to be generate.

Below the 3 mains laws:

Normal Law (Gauss Law)

The Normal law is used to described continuous variables (time, temperature…) where is impossible to predict the value k of a variable X with exactitude. The reason why is the fact that the possibility to give precisely the right value is closed to 0 because the infinity of possibility. For example, it is impossible to predict that temperature in the room 401N will be exactly 20°C because there an infinity of possibility between 19.9°C and 20.1°C. However the the comportment of the variable X can be described thanks to the probability density function (PDF). With this description, it is possible to deduce the probability in a given interval which is represented by the area under the PDF curve between this interval. This probability where X is include in [a,b] interval, \(\mathbb{P}(X\in[a,b])\), is given by the cumulative density function (CDF) which is the primitive of the PDF:

  • PDF: \(f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\)
  • CDF: \(\mathbb{P}(X\in[a,b])]=\int_{a}^{b}f(x)dx\)

Where \(µ\) is the mean and \(\sigma\) is the standard deviation which describe the normal law.

my_mu<-20 # mean value of the normal law
my_sg<-sqrt(20) # std dev of the normal law
dnorm(20,mean=my_mu,sd=my_sg) # point evaluation of the density at x=20 of a normal law centered at my_mu and width sigma=my_sg
## [1] 0.08920621
pnorm(20,mean=my_mu,sd=my_sg) # point evaluation of the CDF at q=20 of this same normal distribution
## [1] 0.5
qnorm(0.05,mean=my_mu,sd=my_sg) # quantile of this normal distribution considering a probability p=0.05
## [1] 12.64399
rnorm(10,mean=my_mu,sd=my_sg) # generate 10 random numbers following this normal distribution
##  [1] 25.49505 14.72169 16.11864 22.73180 18.75411 24.97030 12.22927 22.98842
##  [9] 18.63697 16.87710
Graph of the PDF (right) and the CDF (left)
Graph of the PDF (right) and the CDF (left)

The Cumulative Density Function has remarkable properties:

  • Always positive
  • Always increasing
  • y = 0 to 1

Student distribution

One day may be

Binomial law

The Binary law can be positive or negative, but in the general cases, it is only the positive which is used. This is the law which is followed by a randomization selection when there are ONLY TWO possibility of results: A or B, alive or died. The probability to obtained each event must be constant (but not necessary equal). The probability to obtain the A event will be \(p\) and B will be \(q=(1-p)\). So, with \(n\) selection, the probability to obtain k event of A \(\mathbb{P}(X=k)\) will follow the binomial law \(B(n,p)\). The mean \(\mu\) correspond to the expectation \(\mathbb{E}\) (espérance in french) and will be give by \(\mu=np\). the variance is given by \(var=npq\).

n<-30 # population
p<-0.1 # probability of success
dbinom(x=1,size=n,prob=p) # Probability that I win exactly 1 time in 30 plays
## [1] 0.1413039
pbinom(q=2,size=n,prob=p) # CDF, probability to win 0,1 or 2 times in 30 plays
## [1] 0.4113512
qbinom(p=0.5,size=n,prob=p) # Quantile of 0.5 from the binomial law
## [1] 3
rbinom(n=10,size=n,prob=p) # Sample of size 10 from the binomial law
##  [1] 0 6 6 4 4 8 3 3 1 5

The formula of the binomial law is:

\[\mathbb{P}(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]

Where \(\binom{n}{k}\) is the binomial coefficient (n choose k) and can be calculate with the command choose(n,k) or by \(\binom{n}{k} =C_n^{k}= \frac{n!}{k!(n-k)!}\)

Poisson law

The Poisson law is used to described the probability of a rare event in an interval of time. It is an approximation of the binomial law when the when \(p\) tends to zero (very law probability) and \(n\) tends to \(\infty\) (a lot of repetition). This law take the parameter \(\lambda = np\) and traduce the the expectation \(\mathbb{E}\) and the variance (which is logical since \(p\rightarrow0\Rightarrow q\rightarrow1\) in the binomial law). That’s why the Poisson law is define by \(P(\lambda)\) and represent the mean of the probability where the rare event \(k\) is realized.

n<-30 # number of observation
m <- 20 #  Poisson law parameter
dpois(x=1, lambda =m ) #
## [1] 4.122307e-08
ppois(q=2, lambda =m ) # 
## [1] 4.55515e-07
qpois(p=0.5, lambda =m ) # 
## [1] 20
rpois(n=n, lambda =m ) # 
##  [1] 25 19 17 26 21 21 17 19 22 23 23 15 19 16 17 20 12 14 22 21 19 25 22 25 21
## [26] 23 17 18 17 24

The formula of the Poisson law is: \[\mathbb{P}(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}\]

Note: factorial calculation in R:

factorial(5)
## [1] 120

Evaluation of the shape of a distrbution

It is possible to measurement of the shape of the distribution thanks to library moment:

library("moments")

Skewness: Asymetri

The Skewness moment gives some information about the symmetry of the distribution:

  • 0 ==> No skewness : distribution perfectly symmetrical
  • >0 ==> The tail is on the right side of the distribution, the majority of the values is on the left compare to the mean and outlier on the right
  • <0 ==> The tail is on the left side of the distribution, the majority of the values is on the rigth compare to the mean and outlier on the left
virginica <- subset(iris, Species == "virginica")
skewness(virginica$Sepal.Length, na.rm = TRUE)
## [1] 0.1144447

Which can be see in this graph with the mean in red:

Kurtosys: Thickness of the tail

The Kurtosys moment measure how the tails is heavy or flat.

  • = 3 ==> Represent a normal distribution
  • > 3 ==> The data distribution is leptokurtic, which means it tends to produce more outliers than the normal distribution
  • < 3 ==> The data distribution is platykurtic, which means it tends to produce fewer and less extreme outliers than the normal distribution
virginica <- subset(iris, Species == "virginica")
kurtosis(virginica$Sepal.Length, na.rm = TRUE)
## [1] 2.912058

Graphics

Some graphics can be use in R without package. But they are limited and don’t have a lot possibility to manage them. Those graphs can be useful for a first and fast exploration of the data but for a presentation, it must be interested to create more complex/beautful graphs. That why it is better to work with the package ggplot2.

First of all, install the package ggplot2 then upload it in R thanks to library command:

library(ggplot2)

ggplot2 is defined at least by a data set, some aes, which represent the “aesthetic” of the graphic, and the type of the graphic thanks to the function geom.. The aes is defined at least by a x and a y values. It’s possible to add also the color, the fill, the group. Some example will be showed below.

WARNING:
All the addition of function is add thanks the +

For more information: https://bookdown.org/ael/rexplor/chap8.html (in French)

Graphic can caught in a variable but it is recommended to have one and unique block per graphic:

# Determination of the limits
alpha=0.0001 # kind of "type I risk", the part of the plot I want to hide on the left/right sides.
xmin=qnorm(alpha,mean=my_mu,sd=my_sg) # left value of the range to plot the distribution
xmax=qnorm(1-alpha,mean=my_mu,sd=my_sg) # right value of the range to plot the distribution

# A beautiful plot of this normal law CDF
ggplot()+ # start plot
  xlim(xmin,xmax)+ # define the range to plot
  stat_function(fun=pnorm,args=c(mean=my_mu,sd=my_sg))+ #Define the CDF of a normal law
  stat_function(fun=pnorm,args=c(mean=my_mu,sd=my_sg),geom="area",xlim=c(xmin,20))+ # define the normal law CDF, plot as an area under the curve and plot until the value 15
  xlab("x")+ # decoration of the x axis
  ylab("P(X<=x)")+ # decoration of the y axis
  labs(title=paste("Normal law CDF - mean=",my_mu,", std dev=",signif(my_sg,3)))+ # title of the plot
  theme_bw(base_size=12)+ # ...the final touch
  theme(plot.title = element_text(hjust = 0.5, color = "blue"))

Here stat_function(fun=pnorm,args=c(mean=my_mu,sd=my_sg) is used to draw a function (the CDF of a normal law in this case). The parameter fun= gives the function to draw, and args= take the parameters of this function as arguments. Then a second curve is added with the parameter geom to display the area under the curve.

The point plot:

plot(Ca,  
  type = "p", # type de tracé: points ("p"), lignes ("l"), les deux ("b" ou "o"),
  col = "blue", # couleur, tapez `colours()` pour la liste complète
  pch = 4, # type de symboles, un chiffre entre 0 et 25, tapez `?points`
  cex = 0.5, # taille des symboles
  lty = 3, # type de lignes, un chiffre entre 1 et 6
  lwd = 1.2, # taille de lignes
  ylab = "Ca content", # titre pour l'axe des y
  main="Exercie 3 of the Statistic lecture")
abline(h=mean(Ca)+3*sd(Ca), col="red") # Add a line on the graphic
text(15,200, "3 * Standard Deviation", col="red") # Add a text on the graphic

abline & text functions can add a line and a text respectively in the graphic.

With ggplot2 use geom.point():

# Example of data frame for a binomial law
df_binom<-data.frame(x=c(0:n),p=dbinom(x=c(0:n),size=n,prob=p))
# Genaration of a graph
ggplot(data=df_binom)+
  aes(x=x,y=p)+
  geom_point()+
  geom_segment(aes(x=x,y=0,xend=x,yend=p))+
  theme_light()+
  xlab("Number of success, x")+
  ylab("Probability, p")+
  labs(title=paste("Density of the binomial law, size=",n,", prob=",p))

The Barplot

Basic R code for to generate a barplot of a Poison law:

# Poisson distribution parameter
lambda <- 3  # Lambda parameter

# Generate values from the Poisson distribution
x <- c(0:10)  # Possible values (adjust as needed)
probas <- dpois(x, lambda = lambda)  # PDF
 
# Create the barplot
barplot(probas, names.arg = x, col = "skyblue", main = "Poisson Distribution",
        xlab = "Number of events", ylab = "Probability", ylim = c(0, max(probas) + 0.1))

With ggplot2 use geom.bar():

ggplot(iris, aes(x = Species, y = Sepal.Length))+
  geom_bar(stat = "identity")+
  theme_test()

It is possible to have multigroup in the barplot. It’s just needed to add the parameter fill = group2 in the aes(). By default the group are stacked. To have the barplot side by side, add position=position_dodge() in the geom_bar parameter.

More complicate, the case of 2 continue variables y. It is not possible at this step with geom_bar function, which allow only one x for one y. To solve this issue, it’s needed to reorganized the data to transform a table with y columns to a table with only two columns. This transformation can be done with the function gather(data, key, value, columns_to_gather) of the library tidyr.

  • data ==> data frame, example iris
  • key ==> name of the group
  • value ==> name of the y axis
  • columns_to_gather ==> y variables of the original data frame that it needed to be merged, separate by ,
# Formatage of the data thanks to the library tidyr
library(tidyr)
reformat <- gather(iris, key = "Part_flower", value = "Length", Sepal.Length, Petal.Length)
head(reformat)
# Work on the mean and Standard deviation
reformat.summary <- reformat %>%
  group_by(Part_flower, Species) %>%
  summarise( sd = sd(Length), len = mean(Length))
## `summarise()` has grouped output by 'Part_flower'. You can override using the
## `.groups` argument.
reformat.summary
# Plot 
ggplot(reformat.summary, aes(x = Species, y = len, fill=Part_flower))+
  geom_bar(stat = "identity", position=position_dodge(0.8), width = 0.7)+
  geom_errorbar(
    aes(ymin = len-sd, ymax = len+sd, group = Part_flower),
    width = 0.2, position = position_dodge(0.8)
    )+
  labs(title="Average of the Petal and Sepal Lenght in function of the Species")+
  theme_test()

More inspiring barplot : https://www.datanovia.com/en/fr/lessons/ggplot-barplot/

The Histogram

Basic R code for to generate an histogram of a nominal law:

hist(rnorm(100,mean=my_mu,sd=my_sg), freq=F) 
curve(dnorm(x,mean=my_mu,sd=my_sg),from = xmin,to = xmax,ylab="densité",add=T,col="red")

With ggplot2 use geom.histogram():

#data <- data.frame(x = rnorm(100,mean=my_mu,sd=my_sg))
ggplot(iris, aes(x= Sepal.Length))+
  geom_histogram(aes (y=after_stat(density),color=Species, fill = Species), #after_stat(density) normalize to 1
                 bins = 50,
                 position = "identity",
                 alpha = 0.4,
                 )+ # bins number of interval un the histogram
  geom_density(aes(color = Species), linewidth = 1)+#show the density of each disribution
  stat_function(fun=dnorm,args=c(mean=mean(iris$Sepal.Length),sd=sd(iris$Sepal.Length)), #show the normal law theortic for the global distribution
                color ="red", 
                linewidth = 0.1)+
  scale_color_manual(values = c("#00AFBB", "#E7B800","pink"))+ #use personalized color
  scale_fill_manual(values = c("#00AFBB", "#E7B800","pink"))+ #use personalized color
  theme_bw()

The Box-plot

The boxplot can represent the distribution of the observation in function of the group. It gives also the quantile, the mean and the median information.

The easyiest way to have a boxplot is to use the basic R command: boxplot(quantitative_variable~quatative_variable, labels). Between the quantitative_variable & the quantative_variable there is a (~).
Example:

boxplot(iris$Sepal.Length~iris$Species,ylab="Sepal Length",xlab="Species")

With ggplot2 use geom.boxplot():

p <- ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species, color = Species))+
  ylim(4,8)+
  ylab("Sepal Length")+ 
  geom_boxplot(alpha=0.5, # alpha for the transparence,
               outlier.colour="red",
               outlier.shape=8,outlier.size=4 # Note, outlier are shown twice 
               #outlier.alpha=0 to not represent the outlier twices
               ) +
  stat_summary(fun = mean, geom="point", shape=18, size=5, color="white") +
  geom_point(position="jitter", alpha=.5)+
  theme_classic()
p

The function stat_summary(fun.y = mean) display the mean value.

Multiple grahs

As it is showed, it is possible to combine several graphic in one by addition of goem functions. But sometimes, it is needed to separate the graphics. It is called the faceting. Two function can be used:

  • facet_warp(qualitative_variable, nrow = number of ligne)
  • facet_grid(ligne ~column)
ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length, shape = Species, color=Species))+
  theme_bw()+
  geom_smooth(method="lm", formula= y~poly(x,3), se=TRUE, size = 0.5) +
  geom_point() +
  facet_wrap(iris$Species, nrow = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length, shape = Species, color = Species))+
  theme_bw()+
  geom_point() + 
  facet_grid(~Species)

In the first facing graph, the function geom_smooth(method="lm", formula= y\~poly(x,3), se=TRUE, size = 0.5) has been introduced. This function is used to add a smooth regression curve to show the general trend of the dataset. Here the method used to draw this curve is a fitting linear model lm.

For more inspired graphs:
https://rstudio-pubs-static.s3.amazonaws.com/578122_5e69256788bb4dcca6157d2bcfa7694e.html
https://www.charlesbordet.com/fr/faire-beaux-graphiques-ggplot2/#it%C3%A9ration-4---ajouter-des-couleurs-pour-chaque-groupe


Statistics Test

Normality test

Q-Q Plot

First we need to check if the distribution of the variable follow a normal law - which is the H0 hypothesis - is valid.

A rapid and graphical test is to check the Q-Q plot (droite d’henry):

qqnorm(iris$Sepal.Length, main = "Normal Q-Q Plot of the Global Sepal Length distribution")
qqline(iris$Sepal.Length, col="red")

Also with ggplot:

ggplot(iris, aes(sample = Sepal.Length, colour = Species)) +
  stat_qq() +
  stat_qq_line()+
  theme_light()+
  xlab("Samples Quantiles")+
  ylab("Theorical Quantiles")+
  labs(title="Normal Q-Q Plot of the Sepal Length distribution for each Species")

The visual data analysis is not sufficient to proof the normality of the distribution. It is mandatory to complete by a statistical test in order to check the validation of the H0 hypothesis. If the test is significant (p-value < 0.05), the distribution is not normal. So the failure of the Significant test (p-value > 0.05) results on the normality of the distribution.

The normal test is sensitive to the sample length. Small effective usually pass the test. That why, both the visual inspection AND the normal test must to be performed.

Remark:
In reality, if the H0 hypothesis is NOT rejected, we can not affirm that the sample follow a normal law. Indeed, it is impossible to validate at 100% a theory because there are too many possibility. However, if the test is rejected in the given condition, this fact is certain

Shapiro-Wilk test

In R, this test is performed with the command shapiro_test() include in the rstatix package directly installed with the classic distribution of R.

For One variable

library(rstatix)
## 
## Attachement du package : 'rstatix'
## L'objet suivant est masqué depuis 'package:stats':
## 
##     filter
shapiro_test(iris, Sepal.Length)

Here p-value > 0.05 ==> The test is not significant ==> H0 is validate. The global distribution of the Sepal length follow a normal law.

Variables classify by group

iris %>%
  group_by(Species) %>%
  shapiro_test(Sepal.Length)

Note, here the `tidyverse` library is use in order to manipulate the data

For multiple variables

shapiro_test(iris, Sepal.Length, Petal.Width)

Here the p-value of the Petal width is < 0.05 ==> Significant test ==> H0 rejected. The global distribution of the Petal width do not follow a normal law. This is can be checked by a simple Q-Q plot:

Kolmogorov-Smirnov: an adequation test

Here the normality test is a particular case of the test.

In general this test enable to check the hypothesis that a sample follows of given CDF (normal, binary…) or if two samples follow a same law (normal, binary…).

1) Creation of samples

a<-rnorm(100,mean=0,sd=1) # a normal law with 100 variables
b<-rgamma(100,shape=1,rate=0.8) # a other law 
c<-rnorm(50,mean=0,sd=1) # the same normal law of a
d<-rnorm(50, mean=2, sd=1) # a second normal law but not with the same CDF

2) Do a & b follow the same law ?

ks.test(a,b)
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  a and b
## D = 0.47, p-value = 5.099e-10
## alternative hypothesis: two-sided

The p-value is very weak (<0.05) ==> Significant test ==> H0 rejected ==> The samples do not follow the same law

2) Does a follow a normal law ?

ks.test(a,"pnorm") # Any normal law
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  a
## D = 0.087583, p-value = 0.427
## alternative hypothesis: two-sided
ks.test(a,"pnorm",0,1) # A particular normal law define by a mean 0 & a sd 1
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  a
## D = 0.087583, p-value = 0.427
## alternative hypothesis: two-sided
ks.test(a,"pnorm",2,1) # A particular normal law define by a mean 2 & a sd 1
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  a
## D = 0.69373, p-value < 2.2e-16
## alternative hypothesis: two-sided

For the two first p-value > 0.05, the test is not significant ==> H0 validate . The last one, p-value < 0.05 ==> a does not follow this particular law.

3) Do a, c & d follow the same normal law ?

ks.test(a,c)
## 
##  Exact two-sample Kolmogorov-Smirnov test
## 
## data:  a and c
## D = 0.16, p-value = 0.3503
## alternative hypothesis: two-sided
ks.test(a,d)
## 
##  Exact two-sample Kolmogorov-Smirnov test
## 
## data:  a and d
## D = 0.79, p-value < 2.2e-16
## alternative hypothesis: two-sided

a & c follow the same normal law. Not d.

The Fisher test: F-test

This test is performed to compare the equality of two variances. The H0 hypothesis is that \(H_0:\sigma^{2}_{1}=\sigma^{2}_{2}\).

The F-test of fisher is given by the formula below (with \(DF\) degree of freedom): \[ F = \frac{\frac{\sigma_1^2}{DF_1}}{\frac{\sigma_2^2}{DF_2}} \]

setosa <- subset(iris, Species == "setosa")
virginica <- subset(iris, Species == "virginica")
versicolor <- subset(iris, Species == "versicolor")
var.test(setosa$Sepal.Width, virginica$Sepal.Width,conf.level = 0.95) # with a bilateral risk of 5%
## 
##  F test to compare two variances
## 
## data:  setosa$Sepal.Width and virginica$Sepal.Width
## F = 1.3816, num df = 49, denom df = 49, p-value = 0.2614
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7840128 2.4346017
## sample estimates:
## ratio of variances 
##           1.381578

Here the p-value is superior to 0.05: H0 is not rejected.

var.test(virginica$Sepal.Length, virginica$Sepal.Width,conf.level = 0.95) # with a bilateral risk of 5%
## 
##  F test to compare two variances
## 
## data:  virginica$Sepal.Length and virginica$Sepal.Width
## F = 3.8878, num df = 49, denom df = 49, p-value = 5.014e-06
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  2.206211 6.850965
## sample estimates:
## ratio of variances 
##            3.88776

Here the p-value is largely under 0.05. The H0 hypothesis is rejected, that mean that the variance between this two categories are different.

WARNING:
This test should not be confused with Fisher’s exact test or the Fisher test for the independence, which is performed to compare the dependence of several groups:

  • The exact test of Fisher is used for a contingent tables cf. \(\chi^{2}\) test
  • The F-test of variance compare the variance of two different groups individual
  • The F-test of independence use also the variance but compare one group with all the other group cf. ANOVA

Comparison of the mean

The Student test: t-test

This test is performed to compare the equality of two means. The H0 hypothesis is that \(H_0:µ_{1}=µ_{2}\). In other words, are the populations from which the samples were taken the same?.

This test need 3 conditions:

  1. n > 8
  2. The samples must to follow the normal law (shapiro_test)
  3. The variances must to be equal (F-test) ==> Use the argument var.equal = TRUE

Then we can performed the t test.

t.test(setosa$Sepal.Width, virginica$Sepal.Width,conf.level = 0.95, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  setosa$Sepal.Width and virginica$Sepal.Width
## t = 6.4503, df = 98, p-value = 4.246e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3143257 0.5936743
## sample estimates:
## mean of x mean of y 
##     3.428     2.974

Here the p-value is largely under 0.05. The H0 hypothesis is rejected, that mean that the means of this two categories are different.

If the variances between the two is not equal ==> Use the Welch approximation with var.equal = FALSE

t.test(virginica$Sepal.Length, virginica$Sepal.Width, conf.level = 0.95, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  virginica$Sepal.Length and virginica$Sepal.Width
## t = 35.842, df = 72.643, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.413027 3.814973
## sample estimates:
## mean of x mean of y 
##     6.588     2.974

The Wilcoxon test

When the distribution doesn’t follow a normal law, or if n < 8, it is possible to use the Wilcoxon test:

wilcox.test(iris$Petal.Length, iris$Petal.Width, conf.level = 0.95)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  iris$Petal.Length and iris$Petal.Width
## W = 19349, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Note that the Wilcoxon test does not give the resultats over the mean but by ranking the variables


Detection of Outliers

Outlier can be detected with barplot. However, it can be performed with a statistical test. In R, the detection of outliers is performed thanks to library outliers.

Dixon test: Q-test

The Dixon test can be used for the small samples and evaluate the outliers by comparison of distances a between a suspect point and the measurement nearest to it in size with the range of the measurements. Thus, the H0 hypothesis is: the value is not an outlier.

## 
##  Dixon test for outliers
## 
## data:  data_for_dixon
## Q = 0.96277, p-value < 2.2e-16
## alternative hypothesis: highest value 200 is an outlier

The Dixon test is performed one by one, that means if there are 2 or more variables, the test must be performed until that the p-value is higher than 0.05.

# dataset
data_for_dixon <- c(-1000, 10, 12, 13, 14, 15, 16, 17, 18, 19,200.5)

# initialization of pvalue to enter into the loop
pvalue <- 0

# loop while until the p-value is > 0.05
while(pvalue < 0.05){
  #test
  result_dixon <- dixon.test(data_for_dixon)
  pvalue <- result_dixon$p.value # catch the p-value in variable
  
  # if pvalue is under 0.05 ==> remove the outlier
  if (pvalue <0.05){
    # catch the outlier in variable:
    #     $alternative is a chain of character
    #     Split into a list and select the 3erd word
    #     Convert into a numeric value
    val_outlier <- as.numeric(strsplit(result_dixon$alternative, " ")[[1]][3]) 
    
    # Remove the outlier
    data_for_dixon <- data_for_dixon[-which(data_for_dixon==val_outlier)]
  }
  # Printing
  print(result_dixon)
  print(data_for_dixon)
}
## 
##  Dixon test for outliers
## 
## data:  data_for_dixon
## Q = 0.99313, p-value < 2.2e-16
## alternative hypothesis: lowest value -1000 is an outlier
## 
##  [1]  10.0  12.0  13.0  14.0  15.0  16.0  17.0  18.0  19.0 200.5
## 
##  Dixon test for outliers
## 
## data:  data_for_dixon
## Q = 0.96286, p-value < 2.2e-16
## alternative hypothesis: highest value 200.5 is an outlier
## 
## [1] 10 12 13 14 15 16 17 18 19
## 
##  Dixon test for outliers
## 
## data:  data_for_dixon
## Q = 0.25, p-value = 0.7435
## alternative hypothesis: lowest value 10 is an outlier
## 
## [1] 10 12 13 14 15 16 17 18 19

The Grubbs test

When the dataset has more than 30 values, the Dixon test does not work:

  dixon.test(versicolor$Sepal.Length)
## Une erreur s'est produite : Sample size must be in range 3-30

In this case, the Grubbs test is more useful. The test is also performed one by one. So to remove all the outliers, we need to process by iteration like the Dixon test before.

grubbs.test(versicolor$Sepal.Length)
## 
##  Grubbs test for one outlier
## 
## data:  versicolor$Sepal.Length
## G = 2.06133, U = 0.91151, p-value = 0.8977
## alternative hypothesis: highest value 7 is an outlier
grubbs.test(virginica$Sepal.Length)
## 
##  Grubbs test for one outlier
## 
## data:  virginica$Sepal.Length
## G = 2.65459, U = 0.85325, p-value = 0.1509
## alternative hypothesis: lowest value 4.9 is an outlier
boxplot(virginica$Sepal.Length)

Remark: Here the Grubbs test does not find outlier, whereas the box plot yes. In fact outlier in boxplot are not performed by a true statistical test. That’s why, in case of doulbt, it is mandatory to performe a test.

ANOVA

The analysis of variance. The purpose of this test is to find which factor has an impact on the variance. This analysis is based on the test of Fisher. This is closed to the F-test of variance. Except that the comparison is not performed on two individual groups but compare the variance between all the group over the variance inside one group:

\[ F = \frac{\frac{\sigma_{inter}^2}{DF_{inter}}}{\frac{\sigma_{intra}^2}{DF_{intra}}} \]

With \(\sigma_{inter}^2\) the variance inter (betwenn) group: \(\sigma_{inter}^2=\sum(mean_{group}-mean_{total})^2\)
And \(\sigma_{intra}^2\) the variance in a particular group \(\sigma_{intra}^2=\sum(x_{i}-\overline{x_{group}})^2\)

ANOVA assumes that the observations are independent, follow a normal law and there is no significant outlier. The H0 hypothesis:

One-way ANOVA

ANOVA, in R, is performed thanks to the command aov(observation~factor, dataset). The results can be shown thanks to the command summary.aov.

species_anova <- aov(Sepal.Length~Species, iris)
summary.aov(species_anova)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  63.21  31.606   119.3 <2e-16 ***
## Residuals   147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here the p-value is under 0.05. That means that the H0 is rejected. The factor Species has significant impact on the Sepal length.

Two-way ANOVA

For this example, the dataset ToothGrowth is used, and it is assume that all the condition for the ANOVA are OK. This dataset examines the tooth growth length in guinea pigs based on the dosage and supplementation of vitamin C.

The two-way ANOVA, in R, is performed with the same command than the one-way. The interaction between the two factors are add with the symbol * between the 2 factors: aov(observation~factor1*factor2, dataset).

summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
ToothGrowth$dose <- as.factor(ToothGrowth$dose) # dose is numeric ==> Conversion as a factor
summary(ToothGrowth)
##       len        supp     dose   
##  Min.   : 4.20   OJ:30   0.5:20  
##  1st Qu.:13.07   VC:30   1  :20  
##  Median :19.25           2  :20  
##  Mean   :18.81                   
##  3rd Qu.:25.27                   
##  Max.   :33.90
# 2-way ANOVA
tooth_anova <- aov(len~supp*dose,ToothGrowth)
summary.aov(tooth_anova)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4  15.572 0.000231 ***
## dose         2 2426.4  1213.2  92.000  < 2e-16 ***
## supp:dose    2  108.3    54.2   4.107 0.021860 *  
## Residuals   54  712.1    13.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here the p-value is under 0.05 for both supplementation and the dosage of vitamin C. That means that the H0 is rejected. This two factors have a significant impact. In the same time, the impact of the interaction between this two factor is also significant (p-value <0.05).

The Two-way ANOVA can also be performed without cross interaction. In this case the symbol * is replaced by +

Graph analysis of the ToothGrowth with ggarrange()

# Step 1 creation of the theme
theme_set(
  theme_classic() +
    theme(legend.position = "top")
  )
my3cols <- c("#2E9FDF", "#FC4E07", "#E7B800")

# Prepar
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
df.summary <- ToothGrowth %>%
  group_by(dose,supp) %>%
  summarise(sd = sd(len, na.rm = TRUE), len = mean(len))

# Step 2 Creation of the different graphs

# 1- The box plot
bxp <- ggplot(ToothGrowth, aes(x = dose, y = len))+ geom_boxplot(aes(color = supp)) +
  scale_color_manual(values = my3cols)

# 3- line plot
lin <- ggplot(df.summary, aes(dose, len)) +
  geom_line(aes(linetype = supp, group = supp, color=supp))+
  geom_point()+
  geom_errorbar(
    aes(ymin = len-sd, ymax = len+sd, group = supp, color=supp),
    width = 0.2
  )+
  scale_color_manual(values = my3cols)

# STeo 3 pool the graphs together 

library("ggpubr")
## Warning: le package 'ggpubr' a été compilé avec la version R 4.3.2
figure <- ggarrange(bxp, lin,
                    labels = c("A", "B"),
                    ncol = 2, nrow = 1,
                    common.legend = TRUE,
                    legend = "right")
figure

Princpal Composante Analysis (PCA)

The PCA is a unsupervised analysis which is used to reduce the number of the variables. Indeed, with more than 3 observation, it’s become complicated to represent graphically the information, thus the goal of the PCA is to reduce the number of dimension in order to have a better visualization.

Two packages are recommended to used (developed by french people):
- FactoMineR -> For the statistical methods - factoextra -> For the visualization

library("FactoMineR")
## Warning: le package 'FactoMineR' a été compilé avec la version R 4.3.2
library("factoextra")
## Warning: le package 'factoextra' a été compilé avec la version R 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library("ggpubr")

To work on the PCA, the dataset decathlon is used (created to this purpose).

decathlon<-read.table("https://r-stat-sc-donnees.github.io/decathlon.csv",
                      header=TRUE,sep=";",dec=".",
                      row.names=1,check.names=FALSE,fileEncoding="latin1")
head(decathlon)

PCA of individue

# Make the PCA 
pca.decat <- PCA(decathlon[,-c(11:13)], graph=FALSE) # keep only the variables of interest

Then make the graph with fviz_pca_ind

graph1 <- fviz_pca_ind(pca.decat, axes = c(1,2), #plot with the dimension selected
                       pointshape = 20,
                       pointsize = 4,
                       geom.ind = "point",
                       mean.point = F, #hide the point of the mean
                       col.ind = decathlon$Competition,
                       addEllipses = FALSE)
ggpar(graph1, title = "PCA of the Decathlon Dataset",
      subtitle = expression(paste("Plot by Individue", sep = "")), #expression(paste) => permet de mettren en indice [] ou exposant ^
      caption = "",
      legend = "right",
      legend.title = "Competition")

It could be interesting to see the scree plot

# Scree plot
fviz_eig(pca.decat, addlabels =TRUE)


3D plot

library("rgl")
# Convert the group as a vector
decathlon$Competition <- factor(decathlon$Competition)


#  3 D dans R
plot3d(pca.decat$ind$coord[,1],
       pca.decat$ind$coord[,2],
       pca.decat$ind$coord[,3],
       xlab="Dim1",ylab="Dim2",zlab="Dim3",
       col = rainbow(length(levels(decathlon$Competition))),
       type="s",radius=0.1)

# Add legend
legend3d("topright", legend = levels(decathlon$Competition),
         col = rainbow(length(levels(decathlon$Competition))),
         pch = 16, cex = 1.2)

Remark a new windows will appears.

With the code below and a GIMP software - or with online GIF creator website like here - it is possible to create an animation of the 3D picture. The 3D plot above need to be done before

setwd("C:/Users/hpier/Documents/code/R/ChemoinfoR_tuto/3D")
for (i in 1:360) {
  rgl.viewpoint(i, 20) ; ;
  filename <- paste("pic", formatC(i, digits = 1, flag = "0"), ".png", sep = "") 
  rgl.snapshot(filename) } ;
  getwd()

Remark: rgl.viewpoint is obsoleted

GIF 3D of the Decathlon PCA
GIF 3D of the Decathlon PCA

Linear Regression

Creation of the Linear model

A linear regression \(y=f(x)\) can be implemented thanks to the following function: lm(y~x, data=my_dataset). It is based on the least square method.
To force the curve to trough the origin, it just needed to add the value -1after the x data.

my_model <- lm(data=iris, Sepal.Length~Sepal.Width)
summary(my_model)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5561 -0.6333 -0.1120  0.5579  2.2226 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.5262     0.4789   13.63   <2e-16 ***
## Sepal.Width  -0.2234     0.1551   -1.44    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8251 on 148 degrees of freedom
## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519
# Force trough to the origin 
my_model0 <- lm(Sepal.Length~Sepal.Width-1, data=iris)
summary(my_model0)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width - 1, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5236 -1.0362  0.4823  0.9897  2.8406 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## Sepal.Width  1.86901    0.03265   57.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.235 on 149 degrees of freedom
## Multiple R-squared:  0.9565, Adjusted R-squared:  0.9562 
## F-statistic:  3277 on 1 and 149 DF,  p-value: < 2.2e-16

t value = estimate / standard error => must be high Pr(>|t|) => Probability to have the real value is upper the t value => must be small.
Again we can look at the p-value.
In this case the my_model0 is better than my_model1.

Exploitation of the model

Graph

plot(Sepal.Length~Sepal.Width, data = iris, xlim=c(-10,5), ylim=c(0,10))
abline(my_model, col="red")
abline(my_model0, col="blue")

Statisitcal analysis of the model

# Catch the stats of the model in a variable
Stat_myModel<-summary(my_model)

# Catch the interception (b) and the standard error
interception <- Stat_myModel[["coefficients"]][1,1]
error_std <- Stat_myModel[["coefficients"]][1,2]

# Analysis of the confidence interval
t95<-qt(0.975,2) # Student's bilateral quantile at 5% risk
ICmin_int<-interception - t95 * error_std # lower bound of IC(95) of intercept
ICmax_int<-interception + t95 * error_std # upper bound of IC(95) of intercept

# Print results
print(paste("Interception =",signif(interception,3), "in the interval of [",signif(ICmin_int,3),";",signif(ICmax_int,3),"]."))
## [1] "Interception = 6.53 in the interval of [ 4.47 ; 8.59 ]."

Prediction

To predict new \(Y\) from new value \(x\)

p1 <- predict.lm(my_model, tibble("Sepal.Width"=c(2.7)))
p0 <- predict.lm(my_model0, tibble("Sepal.Width"=c(2.7)))

print(paste("Model 1: Y_pred =", signif(p1,3),
            " | Model 2 (trough 0); Y_pred =",signif(p0,3)))
## [1] "Model 1: Y_pred = 5.92  | Model 2 (trough 0); Y_pred = 5.05"

To predict new \(x\) from \(Y\), is not possible to do it directly with predict.lm. Calculate x from the coefficient following this equation: \(Y=ax+b <=> x=\frac{Y-b}{a}\)

coef_model <- coef(my_model)
coef_model0 <- coef(my_model0)

Y_new <- 8.3
x_pred_1 <- (Y_new-coef_model[1])/coef_model[2]
x_pred_2 <- (Y_new)/coef_model0[1]

print(paste("Model 1: x_pred =", signif(x_pred_1,3),
            " | Model 2 (trough 0); x_pred =",signif(x_pred_2,3)))
## [1] "Model 1: x_pred = -7.94  | Model 2 (trough 0); x_pred = 4.44"

Loop and Comparison test

Loop for

for (i in 1:5){
  print(paste("Lign#", i, sep=" "))
}
## [1] "Lign# 1"
## [1] "Lign# 2"
## [1] "Lign# 3"
## [1] "Lign# 4"
## [1] "Lign# 5"

Test if

The command use ifto check conditions:

  • == for an equivalence
  • !> for a difference
  • <= / >= less/greater than or equal to
  • </ > strictly less/greater than
  • & / || AND / OR ==> for the multiple conditions
  • isTRUE(x) test if X is TRUE
  • na.rm=TRUE to remove the empty values
if (condition 1){
  action 1
}else if ((condtion 2) || (condition 3)){
  action 2
}else{
  other things
}